[alpaka] Caching allocators for host and device#260
[alpaka] Caching allocators for host and device#260waredjeb wants to merge 1 commit intocms-patatrack:masterfrom
Conversation
fwyzard
left a comment
There was a problem hiding this comment.
First round of comments...
| * and sets a maximum of 6,291,455 cached bytes per device | ||
| * | ||
| */ | ||
| struct CachingDeviceAllocator { |
There was a problem hiding this comment.
This class should be templated on the Device type:
| struct CachingDeviceAllocator { | |
| template <typename TDevice> | |
| struct CachingDeviceAllocator { | |
| public: | |
| using Device = TDevice; |
| BlockDescriptor(unsigned int block_bin, | ||
| size_t block_bytes, | ||
| size_t bytes_requested, | ||
| const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device) |
There was a problem hiding this comment.
::ALPAKA_ACCELERATOR_NAMESPACE::Device should be replaced by the Device template type:
| const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device) | |
| Device const& device) |
| bin{block_bin} {} | ||
|
|
||
| // Constructor (suitable for searching maps for a specific block, given a device buffer) | ||
| BlockDescriptor(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> buffer) |
There was a problem hiding this comment.
::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> should be replaced by a buffer type that depends on the Device template parameter:
| BlockDescriptor(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<std::byte> buffer) | |
| using DeviceBuffer = alpaka::Buf<Device, std::byte, Dim1D, Idx>; | |
| ... | |
| BlockDescriptor(DeviceBuffer buffer) |
| using CachedBlocks = std::unordered_multiset<BlockDescriptor, BlockHashByBytes, BlockEqualByBytes>; | ||
|
|
||
| /// Set type for live blocks (hashed by ptr) | ||
| using BusyBlocks = std::unordered_multiset<BlockDescriptor, BlockHashByPtr, BlockEqualByPtr>; |
There was a problem hiding this comment.
Was the performance between std::unordered_multiset and the original std::multiset measured for this case?
| device) ///< [in] The device to be associated with this allocation | ||
| { | ||
| std::unique_lock<std::mutex> mutex_locker(mutex, std::defer_lock); | ||
| int device_idx = getIdxOfDev(device); |
There was a problem hiding this comment.
There is already a function with similar functionality in https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/alpaka/AlpakaCore/getDevIndex.h
| int device_idx = getIdxOfDev(device); | |
| int device_idx = getDevIndex(device); |
(I won't repeat this comment)
| if (!found) { | ||
| search_key.buf = alpaka::allocBuf<std::byte, alpaka_common::Idx>( | ||
| device, static_cast<alpaka_common::Extent>(search_key.bytes)); | ||
| #if CUDA_VERSION >= 11020 |
There was a problem hiding this comment.
The codebase already requires CUDA >= 11.2, is this check really needed?
| search_key.buf = alpaka::allocBuf<std::byte, alpaka_common::Idx>( | ||
| device, static_cast<alpaka_common::Extent>(search_key.bytes)); | ||
| #if CUDA_VERSION >= 11020 | ||
| alpaka::prepareForAsyncCopy(search_key.buf); |
There was a problem hiding this comment.
What does prepareForAsyncCopy() do? I'd guess it to call cudaHostRegister() or similar under the hood for host memory.
Looking at the code I see this is a no-op for CUDA/HIP buffers
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp#L408-L418
(so in principle this call would not be needed for the device allocator)
For host buffers it indeed ends up calling cudaHostRegister()
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L363-L378
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L269-L303
But given the
#if(defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && BOOST_LANG_CUDA) || (defined(ALPAKA_ACC_GPU_HIP_ENABLED) && BOOST_LANG_HIP)the cudaHostRegister() is called only if the source file including this Alpaka header is compiled with ALPAKA_ACC_GPU_CUDA_ENABLED and compiling with nvcc (which we do when that macro is enabled). I'm afraid mixing that with code using BufCpu.hpp compiled without ALPAKA_ACC_GPU_CUDA_ENABLED would technically violate ODR.
I also see the definition of alpaka::detail::BufCpuImpl depends on these macros, in particular whether the m_bPinned member exists or not
https://github.com/alpaka-group/alpaka/blob/ee525dbb1c3e71490a17df80e7d13b3067619b95/include/alpaka/mem/buf/BufCpu.hpp#L103-L105
| */ | ||
| auto DeviceAllocate(size_t bytes, ///< [in] Minimum no. of bytes for the allocation | ||
| const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& | ||
| device) ///< [in] The device to be associated with this allocation |
There was a problem hiding this comment.
What is the plan towards asynchronous (in a sense garbage-collecting) API?
There was a problem hiding this comment.
Do all scopes creating and destroying unique_ptrs have alpaka::wait() before the end of scope?
| inline int getIdxOfDev(const ::ALPAKA_ACCELERATOR_NAMESPACE::Device& device) { | ||
| static const auto devices{alpaka::getDevs<::ALPAKA_ACCELERATOR_NAMESPACE::Platform>()}; | ||
| return (std::find(devices.begin(), devices.end(), device) - devices.begin()); | ||
| } |
There was a problem hiding this comment.
(just repeating) Given https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/alpaka/AlpakaCore/getDevIndex.h this is not needed.
| template <typename TData> | ||
| class DeviceDeleter { | ||
| public: | ||
| DeviceDeleter(::ALPAKA_ACCELERATOR_NAMESPACE::AlpakaDeviceBuf<TData> buffer) : buf{std::move(buffer)} {} |
There was a problem hiding this comment.
This class would need to be templated over Device.
| template <typename TData> | ||
| using unique_ptr = | ||
| #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED | ||
| std::unique_ptr<TData, | ||
| impl::DeviceDeleter< | ||
| std::conditional_t<allocator::policy == allocator::Policy::Caching, std::byte, TData>>>; | ||
| #else | ||
| host::unique_ptr<TData>; | ||
| #endif | ||
| } // namespace device |
There was a problem hiding this comment.
Technically this violates ODR. Templating the unique_ptr also over Device and providing a specialization for DevCudaRt if ALPAKA_ACC_GPU_CUDA_ENABLED is defined would work. In longer term, I think, we should think about hiding the exact deleter type from the type of the unique_ptr (or just go directly to shared_ptr semantics).
| if constexpr (allocator::policy == allocator::Policy::Asynchronous) { | ||
| alpaka::prepareForAsyncCopy(buf); | ||
| } |
There was a problem hiding this comment.
The behavior of Asynchronous policy would be different from cuda, right?
|
reimplemented in #301 |
Caching Allocator and Async Allocator Enabled
Disabling the Caching Allocator and the Async Allocator
make alpaka -j 10 CUDA_BASE=/usr/local/cuda-11.2 USER_CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR -DALPAKA_DISABLE_ASYNC_ALLOCATOR"